# Visual Reasoning
Nvidia.cosmos Reason1 7B GGUF
Cosmos-Reason1-7B is a 7B-parameter foundational model released by NVIDIA, specializing in image-to-text tasks.
Large Language Model
N
DevQuasar
287
1
Vilt Gqa Ft
A vision-language model based on ViLT architecture, fine-tuned specifically for GQA visual reasoning tasks
Text-to-Image
Transformers

V
phucd
62
0
VL Rethinker 7B 6bit
Apache-2.0
This is a multimodal model based on Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks, converted to MLX format for efficient operation on Apple chips.
Text-to-Image
Transformers English

V
mlx-community
19
0
VL Rethinker 72B 8bit
Apache-2.0
This model is a multimodal vision-language model converted from Qwen2.5-VL-7B-Instruct, supporting 8-bit quantization and suitable for visual question-answering tasks.
Text-to-Image
Transformers English

V
mlx-community
18
0
Idefics3 8B Llama3
Apache-2.0
Idefics3 is an open-source multimodal model capable of processing arbitrary sequences of image and text inputs to generate text outputs. It shows significant improvements in OCR, document understanding, and visual reasoning.
Image-to-Text
Transformers English

I
HuggingFaceM4
45.86k
277
MATCHA ViChart
ChartQA is a visual question answering model focused on extracting information from charts and answering related questions.
Text-to-Image
Transformers Other

M
TeeA
16
0
Featured Recommended AI Models